Find me a cluster

One solution. You may have yours.

import geopandas as gpd
import seaborn as sns
from libpysal import graph
from sklearn import cluster, preprocessing
chicago = gpd.read_file(
    "https://martinfleischmann.net/sds/chapter_07/data/chicago_influenza_1918.geojson"
)

Before working with clustering, do you remember that note about data standardisation? The demographic variables in the table are not using the same scale, so you need to do something about it before using K-means.

demographics = [
    "gross_acres",
    "illit",
    "unemployed_pct",
    "ho_pct",
    "agecat1",
    "agecat2",
    "agecat3",
    "agecat4",
    "agecat5",
    "agecat6",
    "agecat7",
]
chicago[demographics] = preprocessing.robust_scale(chicago[demographics])
chicago.head(2)
geography_code gross_acres illit unemployed_pct ho_pct agecat1 agecat2 agecat3 agecat4 agecat5 agecat6 agecat7 influenza geometry
0 G17003100388 -0.194708 0.465558 0.121762 -0.577117 -0.145658 -0.430586 -0.320059 -0.367526 -0.419623 -0.304649 -0.397324 29 POLYGON ((358405.051 570342.347, 358371.811 57...
1 G17003100197 -0.201697 2.413302 -0.701498 -0.629427 0.627451 0.434924 0.408555 0.358749 -0.009057 -0.143917 -0.205352 30 POLYGON ((356903.353 580393.561, 356895.319 58...

If you check the values now, you will see that they are all distributed around 0.

_ = sns.pairplot(chicago[demographics])

Pick a number of clusters

n_clusters = 4

Run K-Means for that number of clusters

kmeans = cluster.KMeans(n_clusters=n_clusters, random_state=0, n_init=1000)
kmeans.fit(chicago[demographics])
KMeans(n_clusters=4, n_init=1000, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Plot the different clusters on a map

chicago["cluster"] = kmeans.labels_

chicago.explore("cluster", categorical=True)
Make this Notebook Trusted to load map: File -> Trust Notebook

Analyse the results: - What do you find? - What are the main characteristics of each cluster? - How are clusters distributed geographically? - Can you identify some groups concentrated on particular areas?

groups = chicago.groupby('cluster')
groups.size()
cluster
0    356
1     36
2     99
3      5
dtype: int64
groups[demographics].mean().T
cluster 0 1 2 3
gross_acres 0.191129 9.231985 1.214300 28.355467
illit 0.136419 0.374769 2.098371 0.290736
unemployed_pct -0.185173 0.217657 -0.190443 0.278034
ho_pct -0.004221 1.288033 0.097141 1.885749
agecat1 -0.130173 0.179272 1.498118 0.956863
agecat2 -0.146134 0.290070 1.586965 1.138612
agecat3 -0.157950 0.306211 1.508939 1.128909
agecat4 -0.168607 0.164381 1.276251 0.918705
agecat5 -0.153945 -0.101216 1.146712 0.677434
agecat6 -0.072975 -0.119848 0.921775 0.542829
agecat7 0.006919 -0.171417 0.789859 0.589063

Create spatially lagged K-Means.

queen = graph.Graph.build_contiguity(chicago).transform("r")

for column in demographics:
    chicago[column + "_lag"] = queen.lag(chicago[column])

demographics_spatial = demographics + [column + "_lag" for column in demographics]

kmeans_lag = cluster.KMeans(n_clusters=n_clusters, random_state=42, n_init=1000)
kmeans_lag.fit(chicago[demographics_spatial])

chicago["cluster_lag"] = kmeans_lag.labels_

chicago.explore("cluster_lag", categorical=True)
Make this Notebook Trusted to load map: File -> Trust Notebook

Develop a regionalisation using agglomerative clustering

agg = cluster.AgglomerativeClustering(n_clusters=n_clusters, connectivity=queen.sparse)
agg.fit(chicago[demographics])
AgglomerativeClustering(connectivity=<496x496 sparse array of type '<class 'numpy.float64'>'
    with 2274 stored elements in Compressed Sparse Row format>,
                        n_clusters=4)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
chicago["cluster_agg"] = agg.labels_

chicago.explore("cluster_agg", categorical=True)
Make this Notebook Trusted to load map: File -> Trust Notebook

Generate a geography that contains only the boundaries of each region and visualise it.

regions = chicago[["cluster_agg", "geometry"]].dissolve("cluster_agg")
regions.reset_index().explore("cluster_agg", categorical=True)
Make this Notebook Trusted to load map: File -> Trust Notebook